Skip to content

feat(tsdb): per-tenant metric cardinality fairness#51

Merged
aksOps merged 1 commit into
mainfrom
feat/per-tenant-cardinality
Apr 27, 2026
Merged

feat(tsdb): per-tenant metric cardinality fairness#51
aksOps merged 1 commit into
mainfrom
feat/per-tenant-cardinality

Conversation

@aksOps

@aksOps aksOps commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 2 of the multi-phase 150–200 component robustness push. Adds per-tenant fairness to the in-memory TSDB cardinality budget so a noisy tenant cannot starve siblings.

  • New METRIC_MAX_CARDINALITY_PER_TENANT env var (default 0 = unlimited; preserves single-tenant legacy behavior).
  • Per-tenant cap is checked first; the existing global METRIC_MAX_CARDINALITY becomes a backstop.
  • Per-tenant overflow buckets are tenant-scoped (key suffix |<tenant>) so overflow stats don't merge across tenants.
  • New labeled metric otelcontext_tsdb_cardinality_overflow_by_tenant_total{tenant_id} — sentinel __global__ when the global cap (not per-tenant) was the trigger. Existing unlabeled OtelContext_tsdb_cardinality_overflow_total is preserved for back-compat dashboards.

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test -race ./... — all 13 packages pass (tsdb tests added)
  • 8 new tsdb tests cover: zero-config baseline, legacy global-only behavior, per-tenant fairness (tenant A exhausts, tenant B unaffected), per-tenant overflow buckets stay separate, flush resets counts, both caps coexist with correct precedence (per-tenant first, global backstop), default behavior unchanged, overflow bucket stat accumulation
  • Aggregator.SetCardinalityLimit API change wired through main.go and the test file; no callers outside the package

Behavior matrix

METRIC_MAX_CARDINALITY METRIC_MAX_CARDINALITY_PER_TENANT Effect
0 0 No enforcement (tests verify)
10000 (default) 0 (default) Legacy global-only behavior
10000 2000 Each tenant up to 2000 series; total clamped to 10000
0 2000 Per-tenant only — total = N tenants × 2000 worst case

Docs

  • CLAUDE.md env-var section now documents the per-tenant cap and the new label
  • docs/OPERATIONS.md defaults section updated; new alert query added under Observability for topk noisy tenants

Follow-ups (separate PRs)

  • Phase 3a: SQLite FTS5 + BM25 for log search (default storage)
  • Phase 3b: Postgres partitioning (opt-in adapter)
  • Phase 4: HTTP OTLP backpressure parity (HTTP 429 + Retry-After)
  • Phase 5: DROP-PARTITION retention (depends on 3b)
  • Phase 6: MCP HTTP streamable robustness for frequent queries

🤖 Generated with Claude Code

Phase 2 of the 150-200 component robustness work. Adds a per-tenant
series budget so a noisy tenant cannot exhaust the global TSDB
cardinality pool and starve siblings of fresh series.

Behavior is opt-in to preserve back-compat:
  - METRIC_MAX_CARDINALITY (existing, default 10000) — global series cap.
  - METRIC_MAX_CARDINALITY_PER_TENANT (new, default 0=unlimited) — when
    set, each tenant gets its own series budget.
  - Per-tenant cap is checked FIRST; global cap is the backstop.
  - Per-tenant overflow buckets are tenant-scoped (key suffix |<tenant>)
    so each tenant's overflow stats stay separate.

Telemetry surface change:
  - TSDBCardinalityOverflow (Counter) — kept for back-compat dashboards.
  - TSDBCardinalityOverflowByTenant (CounterVec, label tenant_id) — new.
    Sentinel "__global__" when the global cap (not per-tenant) triggered.
    Lets operators identify noisy tenants:
        sum by (tenant_id) (
          rate(otelcontext_tsdb_cardinality_overflow_by_tenant_total[5m])
        )

Aggregator API:
  - SetCardinalityLimit signature changed to (global, perTenant int,
    onOverflow func(tenantID string)). Sole external caller (main.go) is
    updated. Old single-arg callback shape is gone.
  - flush() resets seriesPerTenant alongside the buckets map so each
    new window starts every tenant with a fresh budget.

Tests cover: zero-config baseline, global-only legacy behavior, per-tenant
fairness (tenant A exhausts budget, tenant B unaffected), per-tenant
overflow buckets stay separate (no merge regression), flush resets
counts, both caps coexist with correct precedence, default behavior
unchanged when only global is set, overflow bucket stat accumulation.
8 tests, all pass under -race; full suite (13 packages) green.

Docs updated in CLAUDE.md (env-var section) and docs/OPERATIONS.md
(defaults section + new alert query under Observability).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit cf9c1f5 into main Apr 27, 2026
17 checks passed
@aksOps aksOps deleted the feat/per-tenant-cardinality branch April 27, 2026 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant